Workshop Day 1B | 2022-07-25 Jeffrey M. Girard | Pitt Methods
Wrangle I
Tidy Data Principles
There are many ways to store data
We will be learning the tidy data format
Data should be rectangular
Each variable has its own column
Each observation has its own row
Each value has its own cell
Other Data Advice
Name all variables in the first row
This is called a header row
Avoid merged cells for data storage
These are okay for communication
Avoid empty cells whenever possible
Mark missing data as NA
Avoid formatting-as-data for storage
e.g., non-redundant color-coding
Tidying Example 1
Not Tidy
Name
Ann
Bob
Cat
Dom
Age
13
10
11
11
Weight
56.4
46.8
41.3
43.3
❌ Here, each row is a variable and each column is an observation.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, each column is a variable and each row is an observation.
Tidying Example 2
Not Tidy
Names:
Ann
Bob
Cat
Dom
Age
Weight
13
56.4
10
46.8
11
41.3
11
43.3
❌ Here, we have data that is not rectangular because the Names variable has its own row.
Tidy
Name
Age
Weight
Ann
13
56.4
Bob
10
46.8
Cat
11
41.3
Dom
11
43.3
✔️ Here, we have made the data rectangular by moving the Names variable to its own column.
Tidying Example 3
Not Tidy
country
year
cases / population
Afghanistan
1999
NA / 19987071
2000
2666 / 20595360
Brazil
1999
37737 / 172006362
2000
80488 / 174504898
China
1999
212258 / 1272915272
2000
213766 / 1280428583
❌ Here, we have merged cells and two values stored in a single cell.
Tidy
country
year
cases
population
Afghanistan
1999
NA
19987071
Afghanistan
2000
2666
20595360
Brazil
1999
37737
172006362
Brazil
2000
80488
174504898
China
1999
212258
1272915272
China
2000
213766
1280428583
✔️ Here, we have un-merged the countries and separated the cases and populations variables into columns.
Tidying Example 4
Not Tidy
student
grade
Amber
91.5
A-
Bristol
86.2
B
Charlene
94.0
A
Diego
89.3
B+
Legend: Psych. Major, Psych. Minor
❌ Here, we have a missing variable name and formatting-as-data.
Tidy
student
psych
grade
letter
Amber
major
91.5
A-
Bristol
minor
86.2
B
Charlene
major
94.0
A
Diego
NA
89.3
B+
✔️ Here, we have added a column for the psych variable, removed the legend, and named the letter variable.
Tidying Example 5
Not Tidy
student
grade
letter
Amber
91.5
A-
Bristol*
94.2
A
Class Summary
As
2
Yay!
Bs
0
*Grade was revised.
❌ Here, we have two types of data in one file and a footnote as data.
Tidy
student
grade
letter
revised
Amber
91.5
A-
FALSE
Bristol
94.2
A
TRUE
letter
count
notes
A
2
Yay!
B
0
✔️ Here, we have split the data into two separate tables and added the revised and notes variables.
Long vs. Wide Format
Wide Format
date
Boeing
Amazon
Google
2009-01-01
$173.55
$174.90
$174.34
2009-01-02
$172.61
$171.42
$170.04
✔️ Here, we have a wide format where each observation is a date.
Long Format
date
stock
price
2009-01-01
Boeing
$173.55
2009-01-01
Amazon
$174.90
2009-01-01
Google
$174.34
2009-01-02
Boeing
$172.61
2009-01-02
Amazon
$171.42
2009-01-02
Google
$170.04
✔️ Here, we have a long format where each observation is the combination of a date and a stock.
Tibbles
R works particularly well with tidy data
We store tidy data in data frames or tibbles
Tibbles are just fancier data frames (i.e., they have a few extra features)
To use tibbles, we need the tidyverse package
Tibbles are constructed from one or more vectors
The vectors must have the same length
They can contain different types of data
Vectors
We start with three separate vector objects that all have the same length.
We set it up so that the \(n\)-th car in each train corresponds to the same observation.
Tibble
Then we combine the vectors into a single tibble (or data frame) object.
Now, as the tibble moves around, the variables always stay together.
Tibbles Live Coding
# SETUP: Install and load the tidyverse package# Extras pane > Packages tab > Installlibrary(tidyverse)# ==============================================================================# LESSON: Create a tibble from vectorsx <-c(10, 20, 30, 40)xy <- x *2-4ymy_tibble <-tibble(x, y)my_tibble# ==============================================================================# USECASE: You can mix different types of vectors in a single tibblefirst_names <-c("Adam", "Billy", "Caitlyn", "Debra")age_years <-c(12, 13, 10, NA)guests <-tibble(first_names, age_years)guests# ==============================================================================# TIP: To save time, you can also create the vectors in the tibble callgradebook <-tibble(grade =c(95, 83, 90, 76),letter =c("a", "b", "a-", "c"))gradebook# ==============================================================================# PITFALL: Don't try to combine tibbles with different lengthsy <-c(1, 2, 3)x <-c("a", "b")tibble(y, x) #error# ==============================================================================# LESSON: However, the exception is R will "recycle" a single valuetibble(y, x ="a")# ==============================================================================# LESSON: You can "extract" a vector from a tibble using $mytibble <-tibble(x =c(1, 2, 3, 4, 5), y ="test")mytibble$xmytibble$y# ==============================================================================# PITFALL: Don't try to extract a vector that doesn't existmytibble$z #error
Importing and Exporting
Data is usually stored in data files
Importing files into R is called reading
Exporting files from R is called writing
A convenient data file type is a CSV
This stands for comma-separated values
A CSV file is easy to share with other people
The tidyverse package can read/write CSVs
Other packages can read/write other types (e.g., readxl, haven, rio, googlesheets4)
Read/Write Live Coding
# SETUP: Load the tidyverse package (if you haven't yet)library(tidyverse)# ==============================================================================# USECASE: Create a tibble and write it to a filegradebook <-tibble(id =c(123, 456, 789),grade =c("A", "B", "A"))gradebookwrite_csv(gradebook, file ="gradebook.csv")# NOTE: You can see the new file in Extras pane > Files tab.# You can open the file in another program (e.g., Microsoft Excel).# You can also email this file to someone else to share it.# ==============================================================================# PITFALL: Don't swap the order of the tibble and the filewrite_csv("gradebook.csv", gradebook) # error# ==============================================================================# USECASE: Read in a file containing dataold_gradebook <-read_csv("gradebook.csv")old_gradebook# NOTE: read_csv() will examine and guess the data type of each variable.# You can tell it the data type of each variable, but that is more advanced.# ==============================================================================# PITFALL: Don't use the read.csv() and write.csv() functionsold_gradebook <-read.csv("gradebook.csv") # not a tibbleold_gradebook
Wrangle II
Basic wrangling verbs
tidyverse provides tools for wrangling tibbles
These functions are named after verbs
So if you name your objects after nouns…
…your code becomes easier to read
Noun(noun) ❌
Verb(noun) ✔️
blender(fruit)
blend(fruit)
screwdriver(screw)
drive(screw)
boxcutter(box)
cut(box)
Basic wrangling verbs
Primary Functions (most used)
select() retains only certain columns
mutate() adds or transforms columns
filter() retains rows based on criteria
Secondary Functions (less used)
arrange() sorts rows by their values
rename() changes column titles
relocate() moves columns around
Select Live Coding
# SETUP: Load package and inspect example tibblelibrary(tidyverse) # includes the dplyr packagestarwars# ==============================================================================# USECASE: Retain only the specified variablessw <-select(starwars, name)swsw <-select(starwars, name, sex, species)sw# ==============================================================================# PITFALL: Don't forget to save the change with assignmentselect(starwars, name, sex, species)starwars # still includes all variables# ==============================================================================# USECASE: Change the order of variablessw <-select(starwars, species, name, sex)sw# ==============================================================================# USECASE: Retain all variables between two variablessw <-select(starwars, name, hair_color:eye_color)sw# ==============================================================================# USECASE: Retain all variables except the specified onessw <-select(starwars, -sex, -species)swsw <-select(starwars, -c(sex, species))swsw <-select(starwars, -c(hair_color:starships))sw
Rename Live Coding
# USECASE: Change the name of one or more variables# TEMPLATE: df2 <- rename(df, new_name = old_name)starwarssw <-rename(starwars, Character = name)swsw <-rename(starwars, height_cm = height, mass_kg = mass)sw# ==============================================================================# PITFALL: Don't swap the order and try old_name = new_namesw <-rename(starwars, name = Character) # error
Relocate Live Coding
# USECASE: Move variables before another variable or positionstarwarssw <-relocate(starwars, sex, .before = height)swsw <-relocate(starwars, species, sex, .before = name)swsw <-relocate(starwars, homeworld, .before =1)sw# ==============================================================================# PITFALL: Don't forget the period!sw <-relocate(starwars, sex, before = height) sw # height was accidentally renamed to before# ==============================================================================# USECASE: Move variables after another variable or positionsw <-relocate(starwars, sex, .after = height)swsw <-relocate(starwars, species, sex, .after = name)swsw <-relocate(starwars, homeworld, .after =1)sw
Arrange Live Coding
# USECASE: Sort observations by a variablestarwarssw <-arrange(starwars, height)sw # sorted by height, ascendingsw <-arrange(starwars, name)sw # sorted by name, alphabetically# ==============================================================================# USECASE: Sort observations by a variable, in reverse ordersw <-arrange(starwars, desc(height))sw # sorted by height, descendingsw <-arrange(starwars, desc(name))sw # sorted by name, reverse-alphabetically# ==============================================================================# USECASE: Sort observations by multiple variablessw <-arrange(starwars, hair_color, mass)sw # sorted by hair_color, then ties broken by mass
Filter Live Coding
# USECASE: Retain only observations that meet a criterionsw <-filter(starwars, mass >100)sw # only observations with mass greater than 100sw <-filter(starwars, mass <=100)sw # only observations with mass less than or equal to 100sw <-filter(starwars, species =="Human")sw # only observations with species equal to Humansw <-filter(starwars, species !="Human")sw # only observations with species not equal to Human# ==============================================================================# PITFALL: Don't try to use a single = for testing equalitysw <-filter(starwars, height =150) # errorsw <-filter(starwars, height ==150) # correctsw # ==============================================================================# PITFALL: Don't forget that R is case-sensitivesw <-filter(starwars, species =="human")sw # no observations left (because it should be Human)# ==============================================================================# USECASE: Retain only observations that meet complex criteriasw <-filter(starwars, mass >100& height >200)sw # only observations with mass over 100 AND height over 200sw <-filter(starwars, height <100| hair_color =="none")sw # only observations with height under 100 OR hair_color equal to none# ==============================================================================# PITFALL: Don't forget to complete both conditionssw <-filter(starwars, mass >100&<200) # errorsw <-filter(starwars, mass >100& mass <200) # correctsw# ==============================================================================# PITFALL: Don't try to equate a string to a vectorsw <-filter(starwars, species ==c("Human", "Droid")) # errorsw <-filter(starwars, species %in%c("Human", "Droid")) # correctsw
# SETUP: Enable the pipe operator shortcut# Tools > Global Options... > Code tab > Check "Use Native Pipe Operator"# Type out |> or press Ctrl+Shift+M (Windows) / Cmd+Shift+M (Mac)# ==============================================================================# LESSON: The pipe pushes objects to a function as its first argument# TEMPLATE: x |> function_name() is the same as function_name(x)x <-10y <-sqrt(x)yy <- x |>sqrt()y# ==============================================================================# PITFALL: Don't forget to remove the object from the function callx |>sqrt(x) # wrongx |>sqrt() # correct# ==============================================================================# USECASE: You can still use arguments when pipingz <-round(3.14, digits =1)zz <-3.14|>round(digits =1)z# ==============================================================================# USECASE: Pipes are useful with tibbles and wrangling verbsstarwarssw <-select(starwars, name, species, height)swsw <- starwars |>select(name, species, height)sw# ==============================================================================# PITFALL: Don't add a pipe without a step after itsw <- starwars |>select(name, species, height) |># error
Pipelines Live Coding
# USECASE: You can chain multiple pipes together to make a pipelinex <-10|>sqrt() |>round()x# ==============================================================================# TIP: If you want to see the output of a pipeline, you can pipe to print()x <-10|>sqrt() |>round() |>print()# ==============================================================================# TIP: To make your pipelines more readable, move each step to a new linex <-10|>sqrt() |>round() |>print()# ==============================================================================# PITFALL: Don't put the pipe at the beginning of a line, thoughx <-10|>sqrt()|>round()|>print() # error# ==============================================================================# USECASE: Chain together a series of verbs to flexibly wrangle datatallones <- starwars |>select(name, species, height) |>rename(height_cm = height) |>mutate(height_ft = height_cm /30.48) |>filter(height_ft >7) |>arrange(desc(height_ft)) |>print()